1 Business Understanding

Para este projeto usei a base de dados do Inside.
Você também pode encontrar a base de dados aqui.
Neste projeto busco entender as seguintes questões:
- Quais variáveis apresentam maior influencia sobre o preço de uma listagem Airbnb?
- Qual o melhor período para alugar um AirBnb em Buenos Aires?
- Além de outros insights que são disponibilizados ao longo do entendimento do nosso dataset.

2 Data Understanding

2.1 Instalando pacotes necessários

pacotes <- c("plotly","tidyverse","ggrepel","fastDummies","knitr","kableExtra",
             "splines","reshape2","PerformanceAnalytics","correlation","see",
             "ggraph", "car", "olsrr", "jtools", "ggside", "ggplot2", "tidyquant", "DT")


options(rgl.debug = TRUE)

if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}
##               plotly            tidyverse              ggrepel 
##                 TRUE                 TRUE                 TRUE 
##          fastDummies                knitr           kableExtra 
##                 TRUE                 TRUE                 TRUE 
##              splines             reshape2 PerformanceAnalytics 
##                 TRUE                 TRUE                 TRUE 
##          correlation                  see               ggraph 
##                 TRUE                 TRUE                 TRUE 
##                  car                olsrr               jtools 
##                 TRUE                 TRUE                 TRUE 
##               ggside              ggplot2            tidyquant 
##                 TRUE                 TRUE                 TRUE 
##                   DT 
##                 TRUE

2.2 Loading data

listing_df <- read_csv('data/listings.csv') #contém conjunto de dados airbnb completo de Buenos Aires
## Rows: 22713 Columns: 75
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (25): listing_url, source, name, description, neighborhood_overview, pi...
## dbl  (37): id, scrape_id, host_id, host_listings_count, host_total_listings_...
## lgl   (8): host_is_superhost, host_has_profile_pic, host_identity_verified, ...
## date  (5): last_scraped, host_since, calendar_last_scraped, first_review, la...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Abaixo temos os 15 bairros com maior número de listagens cadastradas

listing_df %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(qtd_bairros = n()) %>% 
  slice_max(qtd_bairros, n=15) %>%
  mutate(neighbourhood_cleansed = reorder(neighbourhood_cleansed, -qtd_bairros)) %>% 
  ggplot(aes(x = neighbourhood_cleansed, y = qtd_bairros, fill=neighbourhood_cleansed)) +
  theme(axis.text.x = element_text(angle = 90))+
  geom_col()

Caso você vá até Buenos Aires, estes são os bairros em que você tem maiores chances de encontrar uma acomodação.

Média de preço por bairro, considerando os 5 mais caros
- Para calcular a média, precisamos fazer uns ajustes na coluna “price”, removendo $ e a vírgula e tornando-a numerica.

listing_df$price <- str_replace_all(listing_df$price,'[$]','')
listing_df$price <- str_replace_all(listing_df$price,',','')
listing_df$price <- as.numeric(listing_df$price)
listing_df %>%
  group_by(neighbourhood_cleansed) %>%
  summarise(avg_price = mean(price)) %>%
  slice_max(avg_price, n=5) %>% 
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F,
                font_size = 22)
neighbourhood_cleansed avg_price
Coghlan 260997.61
Puerto Madero 25266.96
Barracas 23476.08
San Telmo 20736.60
Monte Castro 16837.25

Considerando os tipos de quartos disponíveis (room_type), qual a média de preço em cada um deles?

listing_df %>%
  group_by(room_type) %>%
  summarise(avg_price = mean(price)) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F,
                font_size = 22)
room_type avg_price
Entire home/apt 15681.38
Hotel room 51544.88
Private room 11035.68
Shared room 24706.32

Como observado acima, quando se vai a Buenos Aires é mais caro ficar em hotel. Entretando é um pouco estranho que um quarto privado seja mais barato que um quarto compartilhado, você não acha? Vamos investigar isso?
Vamos ver como os preços se distribuem em função de cada tipo de quarto quem sabe alguns outliers estejam influenciando o valor médio das categorias.

ggplotly(
  ggplot(listing_df, aes(x = room_type, y = price)) +
    geom_point(color = "#39568CFF", size = 2.5) +
    labs(x = "Type_room", y = "Price") +
    theme_classic()
)

BINGO! Como a média está sendo influenciada pelos outliers, vou analisar a mediana, que sofre menor influência dos outliers e pode dar um valor mais adequado.

listing_df %>%
  group_by(room_type) %>%
  summarise(median_price = median(price)) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F,
                font_size = 22)
room_type median_price
Entire home/apt 9318
Hotel room 8142
Private room 4763
Shared room 4000

Agora podemos concluir que os quartos compartilhados são os mais baratos. Tudo faz mais sentido agora, não acha?

2.3 Verificando correlações

Antes, precisamos fazer alguns ajustes.

listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'bath','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'s','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'S','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'private','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'Private','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'hared','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'half-','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'Half-','')
listing_df$bathrooms_text <- as.numeric(listing_df$bathrooms_text)

Correlação entre preço, nº de camas, nº de quartos e nº de banheiros.

chart.Correlation((listing_df[,c(41,37,38,39)]), histogram = TRUE)

Pelo visto o preço é influenciado pelo numero de camas, quartos e banheiros que uma listagem tem, entretando mesmo sendo significativa essa correlação, podemos observar que ela não é muito alta.

Outra pergunta interessante é:
- O preço dos imóveis mudam significativamente ao longo do ano?
- Para responder essa questão vamos usar o dataset calendar_df.

calendar_df <- read.csv("data/calendar.csv") #contém o preço de cada listagem durante o período de um ano

As variáveis date e price precisam de alguns ajustes para realizar nossa análise.

calendar_df$date <- as.Date(calendar_df$date)
calendar_df['month'] <- (format(calendar_df$date, '%Y-%m'))
calendar_df$month <- as.factor(calendar_df$month)
calendar_df$price <- str_replace_all(calendar_df$price,'[$]','')
calendar_df$price <- str_replace_all(calendar_df$price,',','')
calendar_df$price <- as.numeric(calendar_df$price)

Mês x mediana do preço

price_month <- calendar_df %>%
  group_by(month) %>%
  summarise(median_price = median(price)) %>%
  ggplot(aes(x = month, y = median_price, group=1)) +
  geom_line(color='grey') +
  geom_point() +
  guides(x = guide_axis(angle = 90)) +
  labs(x= 'Month', y= 'Median Price',
       title = 'Price per month') +
  theme_classic()
price_month

3 Data Preparation

Excluindo colunas vazias, com url, localização, comentários, nomes do host pois não serão usadas nesta análise.

listing_df <- subset(listing_df, select = -c(id, listing_url, scrape_id, picture_url,host_id, host_url, 
                                             host_thumbnail_url, host_picture_url, neighbourhood_group_cleansed,
                                             review_scores_value, calendar_updated, license, bathrooms,neighbourhood,
                                             neighborhood_overview, host_neighbourhood, host_location, host_response_rate, 
                                             host_about,description, name, host_name, first_review, last_review))

3.1 Formatando colunas

  • bedrooms tem 3058 valores NA. Talvez esse imóvel seja um studio, quarta-sala ou algo do tipo e por isso o proprietário esteja considerando que não há quarto, então vou inserir 1 no lugar desses NAs.
  • beds também receberá 1 no lugar de NA, pois pode ser um sofá-cama e por isso não foi classificado como cama.
  • bathrooms_text também receberá 1.
  • Como alguns imóveis não tem number_of_reviews, consequentemente esses imóveis acabam não tendo review_per_month, por isso vou substituir os NA por 0.
listing_df$beds[is.na(listing_df$beds)] <- 1
listing_df$bedrooms[is.na(listing_df$bedrooms)] <- 1
listing_df$bathrooms_text[is.na(listing_df$bathrooms_text)] <- 1
listing_df$number_of_reviews[is.na(listing_df$number_of_reviews)] <- 0
listing_df$reviews_per_month[is.na(listing_df$reviews_per_month)] <- 0

3.2 Transformando variáveis lógicas em binárias

#verificando quais são as variáveis lógicas presentes no dataset
(to.replace <- names(which(sapply(listing_df, is.logical))))
## [1] "host_is_superhost"      "host_has_profile_pic"   "host_identity_verified"
## [4] "has_availability"       "instant_bookable"
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:xts':
## 
##     first, last
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
Cols <-  which(sapply(listing_df, is.logical))
setDT(listing_df)

for(j in Cols){
  set(listing_df, i=NULL, j=j, value= as.numeric(listing_df[[j]]))
}

3.3 Transformando variáveis qualitativas em factor

listing_df$source <- as.factor(listing_df$source)
listing_df$property_type <- as.factor(listing_df$property_type)
listing_df$host_response_time[listing_df$host_response_time == 'N/A'] <- "did not inform"
listing_df$host_response_time <- as.factor(listing_df$host_response_time)
listing_df$neighbourhood_cleansed <- as.factor(listing_df$neighbourhood_cleansed)
listing_df$room_type <- as.factor(listing_df$room_type)

3.4 Ajustando as variáveis:

  • host_verification
  • amenities
  • host_acceptance_rate
  • data
listing_df$host_verifications[listing_df$host_verifications == '[]'] <- 1
listing_df$host_verifications <- as.factor(listing_df$host_verifications)
listing_df$host_verifications <- droplevels(listing_df$host_verifications, exclude = 1)
listing_df$amenities <- lengths(gregexpr(",", listing_df$amenities)) + 1L
listing_df$host_acceptance_rate <- str_remove_all(listing_df$host_acceptance_rate, '[%]')
listing_df$host_acceptance_rate <- as.numeric(listing_df$host_acceptance_rate)

Depois de uma breve análise, observei que as variáveis abaixo não são relevantes,portanto, vamos excluí-las.

listing_df <- subset(listing_df, select = -c(last_scraped, calendar_last_scraped))

3.5 Tratando outliers

Função criada para identificação de outliers através do método de quartil

quartil <- function(column){
  
  q1 <- quantile(column, 0.25, na.rm = TRUE) #1º quartil
  q3 <- quantile(column, 0.75, na.rm = TRUE) #3º quartil
  iq <- q3 - q1 #interquartil
  lim_sup <- q3 + 1.5*iq #limite superior
  return(lim_sup)
}

Aplicação da função

max_beds<- quartil(listing_df$beds)
max_bedrooms <- quartil(listing_df$bedrooms)
max_bathrooms <- quartil(listing_df$bathrooms_text)
max_price <- quartil(listing_df$price)

Valores que estão acima do limite superior

print(paste("beds:",max_beds, "bedrooms:", max_bedrooms, "bathrooms:", max_bathrooms, "price:", max_price))
## [1] "beds: 3.5 bedrooms: 1 bathrooms: 2.25 price: 24068"

Agora vou descartar qualquer linha onde preço esteja acima do limite superior estimado para cada variável.
Excluindo outliers das colunas

for (i in seq_along(listing_df$beds)){
  if (listing_df$beds[i] > 3.5){
    listing_df$beds[i] <- mean(listing_df$beds)
  } 
}
  
for (i in seq_along(listing_df$bedrooms)){
  if (listing_df$bedrooms[i] > 1){
    listing_df$bedrooms[i] <- 1
  } 
}

for (i in seq_along(listing_df$bathrooms_text)){
  if (listing_df$bathrooms_text[i] > 2.25){
    listing_df$bathrooms_text[i] <- mean(listing_df$bathrooms_text)
  } 
}

for (i in seq_along(listing_df$price)){
  if (listing_df$price[i] > 24068){
    listing_df$price[i] <- mean(listing_df$price)
  } 
}

Observe como os valores discrepantes foram eliminados.

boxplot(listing_df$bedrooms)

boxplot(listing_df$beds)

boxplot(listing_df$bathrooms_text)

boxplot(listing_df$price)

boxplot(listing_df$host_acceptance_rate)

Vamos verificar se ainda existem muitos valores NAs presentes em nosso dataset

sapply(listing_df, function(x) sum(is.na(x)))
##                                       source 
##                                            0 
##                                   host_since 
##                                            0 
##                           host_response_time 
##                                            0 
##                         host_acceptance_rate 
##                                         2105 
##                            host_is_superhost 
##                                            0 
##                          host_listings_count 
##                                            0 
##                    host_total_listings_count 
##                                            0 
##                           host_verifications 
##                                           43 
##                         host_has_profile_pic 
##                                            0 
##                       host_identity_verified 
##                                            0 
##                       neighbourhood_cleansed 
##                                            0 
##                                     latitude 
##                                            0 
##                                    longitude 
##                                            0 
##                                property_type 
##                                            0 
##                                    room_type 
##                                            0 
##                                 accommodates 
##                                            0 
##                               bathrooms_text 
##                                            0 
##                                     bedrooms 
##                                            0 
##                                         beds 
##                                            0 
##                                    amenities 
##                                            0 
##                                        price 
##                                            0 
##                               minimum_nights 
##                                            0 
##                               maximum_nights 
##                                            0 
##                       minimum_minimum_nights 
##                                            0 
##                       maximum_minimum_nights 
##                                            0 
##                       minimum_maximum_nights 
##                                            0 
##                       maximum_maximum_nights 
##                                            0 
##                       minimum_nights_avg_ntm 
##                                            0 
##                       maximum_nights_avg_ntm 
##                                            0 
##                             has_availability 
##                                            0 
##                              availability_30 
##                                            0 
##                              availability_60 
##                                            0 
##                              availability_90 
##                                            0 
##                             availability_365 
##                                            0 
##                            number_of_reviews 
##                                            0 
##                        number_of_reviews_ltm 
##                                            0 
##                       number_of_reviews_l30d 
##                                            0 
##                         review_scores_rating 
##                                         4122 
##                       review_scores_accuracy 
##                                         4202 
##                    review_scores_cleanliness 
##                                         4202 
##                        review_scores_checkin 
##                                         4202 
##                  review_scores_communication 
##                                         4201 
##                       review_scores_location 
##                                         4201 
##                             instant_bookable 
##                                            0 
##               calculated_host_listings_count 
##                                            0 
##  calculated_host_listings_count_entire_homes 
##                                            0 
## calculated_host_listings_count_private_rooms 
##                                            0 
##  calculated_host_listings_count_shared_rooms 
##                                            0 
##                            reviews_per_month 
##                                            0

Vamos tratar esses valores faltantes que restam

listing_df$host_acceptance_rate[is.na(listing_df$host_acceptance_rate)] <- 77.077

listing_df <- listing_df[!is.na(listing_df$host_verifications),]
listing_df <- subset(listing_df, select = -c(review_scores_accuracy,review_scores_communication,review_scores_cleanliness, review_scores_location,review_scores_rating,review_scores_checkin))

As variáveis com score foram excluídas porquê tem uma quantidade muito elevada de NAs, portanto manter eslas pode influênciar muito no resultado final do modelo.

3.6 Tratando variáveis qualitativas

  • Neste dataset, temos uma quantidade significativa de preditoras qualitativas, que contém até mais de 50 categorias cada. Diante disso, devem passar por um tratamento específico, afim de reduzir sua dimensionalidade e depois disso serem introduzidas no modelo novamente. Por isso, nesta análise vou dropar essas variáveis e manter no modelo apenas a variável room_type(dummy).
  • Isso torna viável rodar nosso modelo. Estou trabalhando para no futuro trazer outra abordagem, já com essas preditoras qualitativas inclusas no modelo, e assim, comparar e analisar qual modelo tem melhor capacidade preditiva para este estudo.
  • Agora vou dummizar a room_type para inserir no modelo e também dropar as outras variáveis que não são do tipo qualitativas.
listing_df <- subset(listing_df, select = -c(source, host_since, host_response_time, host_verifications, neighbourhood_cleansed,property_type))
listing_df_1_dummies <- dummy_columns(.data = listing_df,
                                      select_columns = c("room_type"),
                                      remove_selected_columns = T,
                                      remove_most_frequent_dummy = T)
summary(listing_df_1_dummies)
##  host_acceptance_rate host_is_superhost host_listings_count
##  Min.   :  0.00       Min.   :0.0000    Min.   :   1.00    
##  1st Qu.: 77.08       1st Qu.:0.0000    1st Qu.:   1.00    
##  Median : 96.00       Median :0.0000    Median :   3.00    
##  Mean   : 84.21       Mean   :0.3266    Mean   :  17.58    
##  3rd Qu.:100.00       3rd Qu.:1.0000    3rd Qu.:  13.00    
##  Max.   :100.00       Max.   :1.0000    Max.   :1787.00    
##  host_total_listings_count host_has_profile_pic host_identity_verified
##  Min.   :   1.00           Min.   :0.000        Min.   :0.0000        
##  1st Qu.:   1.00           1st Qu.:1.000        1st Qu.:1.0000        
##  Median :   3.00           Median :1.000        Median :1.0000        
##  Mean   :  25.01           Mean   :0.982        Mean   :0.8727        
##  3rd Qu.:  17.00           3rd Qu.:1.000        3rd Qu.:1.0000        
##  Max.   :3160.00           Max.   :1.000        Max.   :1.0000        
##     latitude        longitude       accommodates    bathrooms_text     bedrooms
##  Min.   :-34.69   Min.   :-58.53   Min.   : 1.000   Min.   :0.000   Min.   :1  
##  1st Qu.:-34.60   1st Qu.:-58.44   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:1  
##  Median :-34.59   Median :-58.42   Median : 2.000   Median :1.000   Median :1  
##  Mean   :-34.59   Mean   :-58.42   Mean   : 2.874   Mean   :1.155   Mean   :1  
##  3rd Qu.:-34.58   3rd Qu.:-58.39   3rd Qu.: 4.000   3rd Qu.:1.179   3rd Qu.:1  
##  Max.   :-34.53   Max.   :-58.36   Max.   :16.000   Max.   :2.000   Max.   :1  
##       beds         amenities          price       minimum_nights    
##  Min.   :1.000   Min.   :  2.00   Min.   :  175   Min.   :   1.000  
##  1st Qu.:1.000   1st Qu.: 18.00   1st Qu.: 6388   1st Qu.:   2.000  
##  Median :1.000   Median : 30.00   Median : 8969   Median :   3.000  
##  Mean   :1.599   Mean   : 30.37   Mean   : 9765   Mean   :   6.826  
##  3rd Qu.:2.000   3rd Qu.: 42.00   3rd Qu.:12286   3rd Qu.:   5.000  
##  Max.   :3.000   Max.   :103.00   Max.   :24050   Max.   :1000.000  
##  maximum_nights    minimum_minimum_nights maximum_minimum_nights
##  Min.   :    1.0   Min.   :   1.000       Min.   :   1.00       
##  1st Qu.:   90.0   1st Qu.:   2.000       1st Qu.:   2.00       
##  Median :  365.0   Median :   3.000       Median :   3.00       
##  Mean   :  531.8   Mean   :   6.522       Mean   :   6.91       
##  3rd Qu.: 1125.0   3rd Qu.:   4.000       3rd Qu.:   5.00       
##  Max.   :99999.0   Max.   :1000.000       Max.   :1000.00       
##  minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm
##  Min.   :1.000e+00      Min.   :1.000e+00      Min.   :   1.000      
##  1st Qu.:3.650e+02      1st Qu.:3.650e+02      1st Qu.:   2.000      
##  Median :1.125e+03      Median :1.125e+03      Median :   3.000      
##  Mean   :6.638e+05      Mean   :6.638e+05      Mean   :   6.753      
##  3rd Qu.:1.125e+03      3rd Qu.:1.125e+03      3rd Qu.:   5.000      
##  Max.   :2.147e+09      Max.   :2.147e+09      Max.   :1000.000      
##  maximum_nights_avg_ntm has_availability availability_30 availability_60
##  Min.   :1.000e+00      Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:3.650e+02      1st Qu.:1.0000   1st Qu.: 0.00   1st Qu.:13.00  
##  Median :1.125e+03      Median :1.0000   Median :12.00   Median :37.00  
##  Mean   :6.638e+05      Mean   :0.9829   Mean   :12.41   Mean   :32.69  
##  3rd Qu.:1.125e+03      3rd Qu.:1.0000   3rd Qu.:22.00   3rd Qu.:51.00  
##  Max.   :2.147e+09      Max.   :1.0000   Max.   :30.00   Max.   :60.00  
##  availability_90 availability_365 number_of_reviews number_of_reviews_ltm
##  Min.   : 0.00   Min.   :  0.0    Min.   :  0.00    Min.   :  0.000      
##  1st Qu.:33.00   1st Qu.: 89.0    1st Qu.:  1.00    1st Qu.:  0.000      
##  Median :65.00   Median :247.0    Median :  8.00    Median :  4.000      
##  Mean   :55.56   Mean   :219.9    Mean   : 22.14    Mean   :  9.303      
##  3rd Qu.:81.00   3rd Qu.:344.0    3rd Qu.: 26.00    3rd Qu.: 13.000      
##  Max.   :90.00   Max.   :365.0    Max.   :637.00    Max.   :252.000      
##  number_of_reviews_l30d instant_bookable calculated_host_listings_count
##  Min.   : 0.0000        Min.   :0.0000   Min.   :  1.00                
##  1st Qu.: 0.0000        1st Qu.:0.0000   1st Qu.:  1.00                
##  Median : 0.0000        Median :0.0000   Median :  2.00                
##  Mean   : 0.9931        Mean   :0.2905   Mean   : 14.09                
##  3rd Qu.: 1.0000        3rd Qu.:1.0000   3rd Qu.: 10.00                
##  Max.   :44.0000        Max.   :1.0000   Max.   :150.00                
##  calculated_host_listings_count_entire_homes
##  Min.   :  0.00                             
##  1st Qu.:  1.00                             
##  Median :  2.00                             
##  Mean   : 13.27                             
##  3rd Qu.:  9.00                             
##  Max.   :150.00                             
##  calculated_host_listings_count_private_rooms
##  Min.   : 0.000                              
##  1st Qu.: 0.000                              
##  Median : 0.000                              
##  Mean   : 0.624                              
##  3rd Qu.: 0.000                              
##  Max.   :29.000                              
##  calculated_host_listings_count_shared_rooms reviews_per_month
##  Min.   : 0.00000                            Min.   : 0.000   
##  1st Qu.: 0.00000                            1st Qu.: 0.100   
##  Median : 0.00000                            Median : 0.670   
##  Mean   : 0.05748                            Mean   : 1.102   
##  3rd Qu.: 0.00000                            3rd Qu.: 1.650   
##  Max.   :16.00000                            Max.   :19.940   
##  room_type_Hotel room room_type_Private room room_type_Shared room
##  Min.   :0.000000     Min.   :0.00000        Min.   :0.000000     
##  1st Qu.:0.000000     1st Qu.:0.00000        1st Qu.:0.000000     
##  Median :0.000000     Median :0.00000        Median :0.000000     
##  Mean   :0.004455     Mean   :0.09479        Mean   :0.007411     
##  3rd Qu.:0.000000     3rd Qu.:0.00000        3rd Qu.:0.000000     
##  Max.   :1.000000     Max.   :1.00000        Max.   :1.000000

4 Modeling

4.1 Estimação da regressão linear múltipla

4.1.1 Modelagem com todas as variáveis

summary(modelo_listing)
## 
## Call:
## lm(formula = price ~ ., data = listing_df_1_dummies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11373.6  -2589.0   -748.9   1802.8  17401.8 
## 
## Coefficients: (1 not defined because of singularities)
##                                                Estimate Std. Error t value
## (Intercept)                                   2.799e+06  1.070e+05  26.163
## host_acceptance_rate                         -6.904e+00  1.185e+00  -5.826
## host_is_superhost                             5.075e+02  6.020e+01   8.431
## host_listings_count                          -1.204e+01  2.374e+00  -5.071
## host_total_listings_count                     9.299e+00  1.315e+00   7.071
## host_has_profile_pic                         -3.246e+02  1.933e+02  -1.679
## host_identity_verified                       -8.115e+01  8.093e+01  -1.003
## latitude                                      4.485e+04  1.734e+03  25.872
## longitude                                     2.129e+04  1.041e+03  20.449
## accommodates                                  5.207e+02  1.978e+01  26.325
## bathrooms_text                                3.189e+03  8.749e+01  36.448
## bedrooms                                             NA         NA      NA
## beds                                          5.219e+02  4.051e+01  12.883
## amenities                                     3.264e+01  1.838e+00  17.760
## minimum_nights                                7.079e+00  2.555e+00   2.771
## maximum_nights                                2.390e-02  3.150e-02   0.759
## minimum_minimum_nights                        1.243e+01  1.318e+01   0.943
## maximum_minimum_nights                        2.577e+01  1.493e+01   1.726
## minimum_maximum_nights                        2.403e-01  3.759e-01   0.639
## maximum_maximum_nights                        1.711e+00  7.290e-01   2.348
## minimum_nights_avg_ntm                       -5.218e+01  2.330e+01  -2.240
## maximum_nights_avg_ntm                       -1.952e+00  8.934e-01  -2.184
## has_availability                             -1.448e+03  2.007e+02  -7.214
## availability_30                               6.864e+01  6.264e+00  10.959
## availability_60                              -4.802e+00  6.667e+00  -0.720
## availability_90                               1.643e+00  3.547e+00   0.463
## availability_365                              1.761e+00  2.365e-01   7.447
## number_of_reviews                            -7.715e-02  8.711e-01  -0.089
## number_of_reviews_ltm                        -4.908e+00  3.312e+00  -1.482
## number_of_reviews_l30d                       -7.401e+01  2.385e+01  -3.103
## instant_bookable                              2.557e+02  5.992e+01   4.266
## calculated_host_listings_count               -1.449e+02  1.720e+01  -8.420
## calculated_host_listings_count_entire_homes   1.591e+02  1.722e+01   9.242
## calculated_host_listings_count_private_rooms  8.776e+01  2.084e+01   4.212
## calculated_host_listings_count_shared_rooms  -4.866e+00  4.996e+01  -0.097
## reviews_per_month                            -3.523e+02  3.165e+01 -11.133
## `room_type_Hotel room`                        1.515e+03  5.074e+02   2.986
## `room_type_Private room`                     -3.013e+03  1.020e+02 -29.527
## `room_type_Shared room`                      -3.657e+03  3.608e+02 -10.136
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## host_acceptance_rate                         5.75e-09 ***
## host_is_superhost                             < 2e-16 ***
## host_listings_count                          3.99e-07 ***
## host_total_listings_count                    1.58e-12 ***
## host_has_profile_pic                          0.09313 .  
## host_identity_verified                        0.31599    
## latitude                                      < 2e-16 ***
## longitude                                     < 2e-16 ***
## accommodates                                  < 2e-16 ***
## bathrooms_text                                < 2e-16 ***
## bedrooms                                           NA    
## beds                                          < 2e-16 ***
## amenities                                     < 2e-16 ***
## minimum_nights                                0.00560 ** 
## maximum_nights                                0.44812    
## minimum_minimum_nights                        0.34579    
## maximum_minimum_nights                        0.08439 .  
## minimum_maximum_nights                        0.52267    
## maximum_maximum_nights                        0.01890 *  
## minimum_nights_avg_ntm                        0.02511 *  
## maximum_nights_avg_ntm                        0.02894 *  
## has_availability                             5.61e-13 ***
## availability_30                               < 2e-16 ***
## availability_60                               0.47135    
## availability_90                               0.64323    
## availability_365                             9.90e-14 ***
## number_of_reviews                             0.92943    
## number_of_reviews_ltm                         0.13841    
## number_of_reviews_l30d                        0.00192 ** 
## instant_bookable                             1.99e-05 ***
## calculated_host_listings_count                < 2e-16 ***
## calculated_host_listings_count_entire_homes   < 2e-16 ***
## calculated_host_listings_count_private_rooms 2.54e-05 ***
## calculated_host_listings_count_shared_rooms   0.92240    
## reviews_per_month                             < 2e-16 ***
## `room_type_Hotel room`                        0.00283 ** 
## `room_type_Private room`                      < 2e-16 ***
## `room_type_Shared room`                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3797 on 22632 degrees of freedom
## Multiple R-squared:  0.298,  Adjusted R-squared:  0.2969 
## F-statistic: 259.7 on 37 and 22632 DF,  p-value: < 2.2e-16

4.1.2 Procedimento Step-wise no modelo

summary(step_modelo_listing)
## 
## Call:
## lm(formula = price ~ host_acceptance_rate + host_is_superhost + 
##     host_listings_count + host_total_listings_count + latitude + 
##     longitude + accommodates + bathrooms_text + beds + amenities + 
##     minimum_nights + maximum_maximum_nights + minimum_nights_avg_ntm + 
##     maximum_nights_avg_ntm + has_availability + availability_30 + 
##     availability_365 + number_of_reviews_l30d + instant_bookable + 
##     calculated_host_listings_count + calculated_host_listings_count_entire_homes + 
##     calculated_host_listings_count_private_rooms + reviews_per_month + 
##     `room_type_Hotel room` + `room_type_Private room` + `room_type_Shared room`, 
##     data = listing_df_1_dummies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11403.1  -2588.3   -754.7   1802.9  17733.2 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   2.783e+06  1.067e+05  26.076
## host_acceptance_rate                         -7.174e+00  1.175e+00  -6.108
## host_is_superhost                             4.732e+02  5.848e+01   8.092
## host_listings_count                          -1.202e+01  2.372e+00  -5.066
## host_total_listings_count                     9.279e+00  1.314e+00   7.061
## latitude                                      4.464e+04  1.732e+03  25.773
## longitude                                     2.114e+04  1.037e+03  20.390
## accommodates                                  5.203e+02  1.977e+01  26.319
## bathrooms_text                                3.190e+03  8.742e+01  36.494
## beds                                          5.229e+02  4.050e+01  12.913
## amenities                                     3.195e+01  1.814e+00  17.611
## minimum_nights                                7.019e+00  2.552e+00   2.751
## maximum_maximum_nights                        1.636e+00  7.080e-01   2.311
## minimum_nights_avg_ntm                       -1.388e+01  2.736e+00  -5.074
## maximum_nights_avg_ntm                       -1.636e+00  7.080e-01  -2.311
## has_availability                             -1.491e+03  1.986e+02  -7.508
## availability_30                               6.425e+01  2.584e+00  24.866
## availability_365                              1.776e+00  2.151e-01   8.256
## number_of_reviews_l30d                       -8.658e+01  2.276e+01  -3.803
## instant_bookable                              2.553e+02  5.969e+01   4.278
## calculated_host_listings_count               -1.464e+02  1.532e+01  -9.555
## calculated_host_listings_count_entire_homes   1.604e+02  1.533e+01  10.468
## calculated_host_listings_count_private_rooms  8.827e+01  1.944e+01   4.541
## reviews_per_month                            -3.728e+02  2.985e+01 -12.488
## `room_type_Hotel room`                        1.528e+03  4.990e+02   3.061
## `room_type_Private room`                     -3.010e+03  1.017e+02 -29.586
## `room_type_Shared room`                      -3.670e+03  3.092e+02 -11.869
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## host_acceptance_rate                         1.03e-09 ***
## host_is_superhost                            6.15e-16 ***
## host_listings_count                          4.10e-07 ***
## host_total_listings_count                    1.70e-12 ***
## latitude                                      < 2e-16 ***
## longitude                                     < 2e-16 ***
## accommodates                                  < 2e-16 ***
## bathrooms_text                                < 2e-16 ***
## beds                                          < 2e-16 ***
## amenities                                     < 2e-16 ***
## minimum_nights                               0.005952 ** 
## maximum_maximum_nights                       0.020840 *  
## minimum_nights_avg_ntm                       3.93e-07 ***
## maximum_nights_avg_ntm                       0.020840 *  
## has_availability                             6.25e-14 ***
## availability_30                               < 2e-16 ***
## availability_365                              < 2e-16 ***
## number_of_reviews_l30d                       0.000143 ***
## instant_bookable                             1.89e-05 ***
## calculated_host_listings_count                < 2e-16 ***
## calculated_host_listings_count_entire_homes   < 2e-16 ***
## calculated_host_listings_count_private_rooms 5.62e-06 ***
## reviews_per_month                             < 2e-16 ***
## `room_type_Hotel room`                       0.002205 ** 
## `room_type_Private room`                      < 2e-16 ***
## `room_type_Shared room`                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3797 on 22643 degrees of freedom
## Multiple R-squared:  0.2976, Adjusted R-squared:  0.2968 
## F-statistic:   369 on 26 and 22643 DF,  p-value: < 2.2e-16

Kernel density estimation (KDE)

listing_df_1_dummies %>%
  ggplot() +
  geom_density(aes(x = step_modelo_listing$residuals), fill = "#55C667FF") +
  labs(x = "Residuos do Modelo Stepwise",
       y = "Densidade") +
  theme_bw()

### Teste de aderência dos resíduos à normalidade

sf_teste <- function (x) 
{
  DNAME <- deparse(substitute(x))
  x <- sort(x[complete.cases(x)])
  n <- length(x)
  if ((n < 5 || n > 25000)) 
    stop("sample size must be between 5 and 5000")
  y <- qnorm(ppoints(n, a = 3/8))
  W <- cor(x, y)^2
  u <- log(n)
  v <- log(u)
  mu <- -1.2725 + 1.0521 * (v - u)
  sig <- 1.0308 - 0.26758 * (v + 2/u)
  z <- (log(1 - W) - mu)/sig
  pval <- pnorm(z, lower.tail = FALSE)
  RVAL <- list(statistic = c(W = W), p.value = pval, method = "Shapiro-Francia normality test", 
               data.name = DNAME)
  class(RVAL) <- "htest"
  return(RVAL)
}

sf_teste(step_modelo_listing$residuals)
## 
##  Shapiro-Francia normality test
## 
## data:  step_modelo_listing$residuals
## W = 0.94135, p-value < 2.2e-16

4.1.3 Histograma

listing_df_1_dummies %>%
  mutate(residuos = step_modelo_listing$residuals) %>%
  ggplot(aes(x = residuos)) +
  geom_histogram(aes(y = ..density..), 
                 color = "white", 
                 fill = "#440154FF", 
                 bins = 30,
                 alpha = 0.6) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(step_modelo_listing$residuals),
                            sd = sd(step_modelo_listing$residuals)),
                size = 2, color = "grey30") +
  scale_color_manual(values = "grey50") +
  labs(x = "Residuos",
       y = "Frequencia") +
  theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

O teste de Shapiro-Francia comprovou a não derência à normalidade dos resíduos. Diante disso, vou fazer uma transformação Box-Cox na variável dependente e rodar novo modelo.

4.1.4 Transformação Box-Cox

lambda_BC <- powerTransform(listing_df_1_dummies$price)
lambda_BC
## Estimated transformation parameter 
## listing_df_1_dummies$price 
##                   0.339769

4.1.5 Inserindo o lambda de Box-Cox na base de dados para a estimação de um novo modelo

listing_df_1_dummies$bcprice <- (((listing_df_1_dummies$price ^ lambda_BC$lambda) - 1) / 
                         lambda_BC$lambda)

4.1.6 Estimando um novo modelo múltiplo com variável dependente transformada por Box-Cox

modelo_listing_bc <- lm(formula = bcprice ~ . -price, na.rm = T,
                        data = listing_df_1_dummies)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'na.rm' will be disregarded
summary(modelo_listing_bc)
## 
## Call:
## lm(formula = bcprice ~ . - price, data = listing_df_1_dummies, 
##     na.rm = T)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.678  -5.860  -0.882   5.102  39.866 
## 
## Coefficients: (1 not defined because of singularities)
##                                                Estimate Std. Error t value
## (Intercept)                                   6.934e+03  2.416e+02  28.702
## host_acceptance_rate                         -1.079e-02  2.675e-03  -4.035
## host_is_superhost                             1.297e+00  1.359e-01   9.542
## host_listings_count                          -2.429e-02  5.359e-03  -4.532
## host_total_listings_count                     2.070e-02  2.969e-03   6.971
## host_has_profile_pic                         -7.940e-01  4.365e-01  -1.819
## host_identity_verified                       -3.042e-02  1.827e-01  -0.166
## latitude                                      1.119e+02  3.914e+00  28.582
## longitude                                     5.159e+01  2.350e+00  21.953
## accommodates                                  1.256e+00  4.465e-02  28.135
## bathrooms_text                                6.523e+00  1.975e-01  33.026
## bedrooms                                             NA         NA      NA
## beds                                          1.166e+00  9.146e-02  12.750
## amenities                                     8.106e-02  4.149e-03  19.534
## minimum_nights                                1.787e-02  5.768e-03   3.098
## maximum_nights                                3.375e-05  7.113e-05   0.474
## minimum_minimum_nights                        3.773e-02  2.976e-02   1.268
## maximum_minimum_nights                        6.794e-02  3.372e-02   2.015
## minimum_maximum_nights                        1.036e-03  8.486e-04   1.221
## maximum_maximum_nights                        5.090e-03  1.646e-03   3.093
## minimum_nights_avg_ntm                       -1.442e-01  5.259e-02  -2.742
## maximum_nights_avg_ntm                       -6.126e-03  2.017e-03  -3.037
## has_availability                             -3.552e+00  4.531e-01  -7.840
## availability_30                               1.736e-01  1.414e-02  12.279
## availability_60                              -2.588e-02  1.505e-02  -1.720
## availability_90                               1.790e-02  8.007e-03   2.236
## availability_365                              4.353e-03  5.340e-04   8.152
## number_of_reviews                            -5.978e-04  1.967e-03  -0.304
## number_of_reviews_ltm                        -1.074e-02  7.478e-03  -1.437
## number_of_reviews_l30d                       -2.179e-01  5.385e-02  -4.046
## instant_bookable                              5.928e-01  1.353e-01   4.382
## calculated_host_listings_count               -4.477e-01  3.884e-02 -11.526
## calculated_host_listings_count_entire_homes   4.788e-01  3.888e-02  12.317
## calculated_host_listings_count_private_rooms  3.066e-01  4.704e-02   6.517
## calculated_host_listings_count_shared_rooms  -6.129e-03  1.128e-01  -0.054
## reviews_per_month                            -7.747e-01  7.145e-02 -10.842
## `room_type_Hotel room`                        2.651e+00  1.145e+00   2.314
## `room_type_Private room`                     -8.956e+00  2.304e-01 -38.881
## `room_type_Shared room`                      -1.150e+01  8.146e-01 -14.118
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## host_acceptance_rate                         5.48e-05 ***
## host_is_superhost                             < 2e-16 ***
## host_listings_count                          5.88e-06 ***
## host_total_listings_count                    3.23e-12 ***
## host_has_profile_pic                          0.06890 .  
## host_identity_verified                        0.86777    
## latitude                                      < 2e-16 ***
## longitude                                     < 2e-16 ***
## accommodates                                  < 2e-16 ***
## bathrooms_text                                < 2e-16 ***
## bedrooms                                           NA    
## beds                                          < 2e-16 ***
## amenities                                     < 2e-16 ***
## minimum_nights                                0.00195 ** 
## maximum_nights                                0.63517    
## minimum_minimum_nights                        0.20482    
## maximum_minimum_nights                        0.04390 *  
## minimum_maximum_nights                        0.22217    
## maximum_maximum_nights                        0.00198 ** 
## minimum_nights_avg_ntm                        0.00611 ** 
## maximum_nights_avg_ntm                        0.00239 ** 
## has_availability                             4.70e-15 ***
## availability_30                               < 2e-16 ***
## availability_60                               0.08550 .  
## availability_90                               0.02536 *  
## availability_365                             3.77e-16 ***
## number_of_reviews                             0.76114    
## number_of_reviews_ltm                         0.15082    
## number_of_reviews_l30d                       5.22e-05 ***
## instant_bookable                             1.18e-05 ***
## calculated_host_listings_count                < 2e-16 ***
## calculated_host_listings_count_entire_homes   < 2e-16 ***
## calculated_host_listings_count_private_rooms 7.34e-11 ***
## calculated_host_listings_count_shared_rooms   0.95666    
## reviews_per_month                             < 2e-16 ***
## `room_type_Hotel room`                        0.02065 *  
## `room_type_Private room`                      < 2e-16 ***
## `room_type_Shared room`                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.573 on 22632 degrees of freedom
## Multiple R-squared:  0.3416, Adjusted R-squared:  0.3405 
## F-statistic: 317.4 on 37 and 22632 DF,  p-value: < 2.2e-16

Step-wise no modelo com box-cox

summary(step_modelo_listing_bc)
## 
## Call:
## lm(formula = bcprice ~ host_acceptance_rate + host_is_superhost + 
##     host_listings_count + host_total_listings_count + latitude + 
##     longitude + accommodates + bathrooms_text + beds + amenities + 
##     minimum_nights + maximum_maximum_nights + minimum_nights_avg_ntm + 
##     maximum_nights_avg_ntm + has_availability + availability_30 + 
##     availability_365 + number_of_reviews_l30d + instant_bookable + 
##     calculated_host_listings_count + calculated_host_listings_count_entire_homes + 
##     calculated_host_listings_count_private_rooms + reviews_per_month + 
##     `room_type_Hotel room` + `room_type_Private room` + `room_type_Shared room`, 
##     data = listing_df_1_dummies, na.rm = T)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.904  -5.871  -0.879   5.089  40.235 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   6.892e+03  2.410e+02  28.595
## host_acceptance_rate                         -1.097e-02  2.652e-03  -4.135
## host_is_superhost                             1.240e+00  1.320e-01   9.393
## host_listings_count                          -2.416e-02  5.356e-03  -4.511
## host_total_listings_count                     2.060e-02  2.967e-03   6.942
## latitude                                      1.115e+02  3.911e+00  28.502
## longitude                                     5.113e+01  2.341e+00  21.836
## accommodates                                  1.254e+00  4.464e-02  28.082
## bathrooms_text                                6.525e+00  1.974e-01  33.057
## beds                                          1.165e+00  9.144e-02  12.744
## amenities                                     8.018e-02  4.097e-03  19.572
## minimum_nights                                1.765e-02  5.762e-03   3.063
## maximum_maximum_nights                        4.727e-03  1.599e-03   2.957
## minimum_nights_avg_ntm                       -3.847e-02  6.178e-03  -6.227
## maximum_nights_avg_ntm                       -4.727e-03  1.599e-03  -2.957
## has_availability                             -3.543e+00  4.484e-01  -7.901
## availability_30                               1.684e-01  5.834e-03  28.862
## availability_365                              4.846e-03  4.856e-04   9.979
## number_of_reviews_l30d                       -2.428e-01  5.140e-02  -4.724
## instant_bookable                              5.732e-01  1.348e-01   4.253
## calculated_host_listings_count               -4.491e-01  3.460e-02 -12.982
## calculated_host_listings_count_entire_homes   4.799e-01  3.461e-02  13.868
## calculated_host_listings_count_private_rooms  3.074e-01  4.389e-02   7.003
## reviews_per_month                            -8.183e-01  6.741e-02 -12.139
## `room_type_Hotel room`                        2.612e+00  1.127e+00   2.318
## `room_type_Private room`                     -8.980e+00  2.297e-01 -39.091
## `room_type_Shared room`                      -1.153e+01  6.982e-01 -16.512
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## host_acceptance_rate                         3.56e-05 ***
## host_is_superhost                             < 2e-16 ***
## host_listings_count                          6.48e-06 ***
## host_total_listings_count                    3.96e-12 ***
## latitude                                      < 2e-16 ***
## longitude                                     < 2e-16 ***
## accommodates                                  < 2e-16 ***
## bathrooms_text                                < 2e-16 ***
## beds                                          < 2e-16 ***
## amenities                                     < 2e-16 ***
## minimum_nights                                0.00220 ** 
## maximum_maximum_nights                        0.00311 ** 
## minimum_nights_avg_ntm                       4.82e-10 ***
## maximum_nights_avg_ntm                        0.00311 ** 
## has_availability                             2.89e-15 ***
## availability_30                               < 2e-16 ***
## availability_365                              < 2e-16 ***
## number_of_reviews_l30d                       2.33e-06 ***
## instant_bookable                             2.12e-05 ***
## calculated_host_listings_count                < 2e-16 ***
## calculated_host_listings_count_entire_homes   < 2e-16 ***
## calculated_host_listings_count_private_rooms 2.58e-12 ***
## reviews_per_month                             < 2e-16 ***
## `room_type_Hotel room`                        0.02045 *  
## `room_type_Private room`                      < 2e-16 ***
## `room_type_Shared room`                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.575 on 22643 degrees of freedom
## Multiple R-squared:  0.3411, Adjusted R-squared:  0.3403 
## F-statistic: 450.8 on 26 and 22643 DF,  p-value: < 2.2e-16

4.1.7 Teste de Shapiro-Francia

sf_teste(step_modelo_listing_bc$residuals)
## 
##  Shapiro-Francia normality test
## 
## data:  step_modelo_listing_bc$residuals
## W = 0.98678, p-value < 2.2e-16

4.1.8 Diagnóstico de Heterocedasticidade para o Modelo Stepwise com Box-Cox

ols_test_breusch_pagan(step_modelo_listing_bc)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : bcprice 
##  Variables: fitted values of bcprice 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    166.4628 
##  Prob > Chi2   =    4.383087e-38

Além dos resíduos não serem aderentes à normalidade, também observamos que o teste de heterocedasticidade aponta que há variáveis omissas que seriam relevantes para explicar Y.

4.1.9 Resumo dos dois modelos obtidos pelo procedimento Stepwise (linear e com Box-Cox)

export_summs(step_modelo_listing, step_modelo_listing_bc,
             model.names = c("Modelo Linear","Modelo Box-Cox"),
             scale = F, digits = 6)
## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools
Modelo LinearModelo Box-Cox
(Intercept)2783283.930766 ***6891.580161 ***
(106736.510416)   (241.009163)   
host_acceptance_rate-7.174236 ***-0.010967 ***
(1.174658)   (0.002652)   
host_is_superhost473.219863 ***1.240267 ***
(58.477537)   (0.132041)   
host_listings_count-12.016408 ***-0.024163 ***
(2.372168)   (0.005356)   
host_total_listings_count9.278918 ***0.020599 ***
(1.314082)   (0.002967)   
latitude44635.301751 ***111.459061 ***
(1731.896440)   (3.910592)   
longitude21143.923598 ***51.128885 ***
(1036.981220)   (2.341485)   
accommodates520.280855 ***1.253501 ***
(19.768464)   (0.044637)   
bathrooms_text3190.209610 ***6.525012 ***
(87.417937)   (0.197388)   
beds522.931394 ***1.165353 ***
(40.497830)   (0.091443)   
amenities31.952626 ***0.080181 ***
(1.814364)   (0.004097)   
minimum_nights7.019189 ** 0.017648 ** 
(2.551827)   (0.005762)   
maximum_maximum_nights1.636109 *  0.004727 ** 
(0.707957)   (0.001599)   
minimum_nights_avg_ntm-13.883070 ***-0.038474 ***
(2.736124)   (0.006178)   
maximum_nights_avg_ntm-1.636106 *  -0.004727 ** 
(0.707957)   (0.001599)   
has_availability-1491.016008 ***-3.543232 ***
(198.602532)   (0.448441)   
availability_3064.248017 ***0.168383 ***
(2.583720)   (0.005834)   
availability_3651.775529 ***0.004846 ***
(0.215067)   (0.000486)   
number_of_reviews_l30d-86.579008 ***-0.242810 ***
(22.763789)   (0.051400)   
instant_bookable255.337518 ***0.573175 ***
(59.685222)   (0.134768)   
calculated_host_listings_count-146.397867 ***-0.449140 ***
(15.322387)   (0.034598)   
calculated_host_listings_count_entire_homes160.431316 ***0.479931 ***
(15.326519)   (0.034607)   
calculated_host_listings_count_private_rooms88.273160 ***0.307351 ***
(19.437400)   (0.043889)   
reviews_per_month-372.799997 ***-0.818256 ***
(29.852683)   (0.067407)   
`room_type_Hotel room`1527.619752 ** 2.611745 *  
(498.979792)   (1.126688)   
`room_type_Private room`-3010.019883 ***-8.979945 ***
(101.736309)   (0.229719)   
`room_type_Shared room`-3670.118672 ***-11.529063 ***
(309.219994)   (0.698213)   
N22670           22670           
R20.297627    0.341093    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Vou adicionar ao dataset valores de Yhat com stepwise e stepwise + Box-Cox para fins de comparação

listing_df$yhat_step_listing <- step_modelo_listing$fitted.values
listing_df$yhat_step_modelo_bc <- (((step_modelo_listing_bc$fitted.values*(lambda_BC$lambda))+
                                      1))^(1/(lambda_BC$lambda))

4.1.10 Visualizando os dois fitted values no dataset

listing_df %>%
  select(price, yhat_step_listing, yhat_step_modelo_bc) %>%
  DT::datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

4.1.11 Ajustes dos modelos: valores previstos (fitted values) X valores reais

listing_df %>%
  ggplot() +
  geom_smooth(aes(x = price, y = yhat_step_listing , color = "Stepwise"),
              method = "lm", se = F, formula = y ~ splines::bs(x, df = 5), size = 1.5) +
  geom_point(aes(x = price, y = yhat_step_listing),
             color = "#440154FF", alpha = 0.6, size = 2) +
  geom_smooth(aes(x = price, y = yhat_step_modelo_bc, color = "Stepwise Box-Cox"),
              method = "lm", se = F, formula = y ~ splines::bs(x, df = 5), size = 1.5) +
  geom_point(aes(x = price, y = yhat_step_modelo_bc),
             color = "#287D8EFF", alpha = 0.6, size = 2) +
  geom_smooth(aes(x = price, y = price), method = "lm", formula = y ~ x,
              color = "grey30", size = 1.05,
              linetype = "longdash") +
  scale_color_manual("Modelos:", 
                     values = c("#287D8EFF", "#440154FF")) +
  labs(x = "price", y = "Fitted Values") +
  theme(panel.background = element_rect("white"),
        panel.grid = element_line("grey95"),
        panel.border = element_rect(NA),
        legend.position = "bottom")

# Conclusão

  • Top 3 bairros com maior quantidade de acomodações: Palermo, Recoleta e San Nicolas.
  • Top 5 bairros mais caros: Puerto Madero, Villa Real e Palermo, Floresta e Recoleta, ou seja, entre os bairros com maior quantidade de acomodações, 2 também estão entre os mais caros (Palermo e Recoleta).
  • Se você quer deixar sua viagem ainda mais economica, recomendo escolher quartos compartilhados ou salas privadas, que são quem apresentaram preços mais acessiveis.
  • O modelo com maior capacidade preditiva é o step_modelo_listing_bc.